In [19]:
from IPython.core.display import display, HTML
display(HTML('<style>.container { width:100%; !important } </style>'))


The Natural Language Toolkit (NLKT)

This notebook introduces the Natural Language ToolKit (NLKT). Our first example is concerned with classification: We want to see whether it is possible to predict the gender of a given first name. This example is taken from Chapter 6 of the NLTK book. To begin with, we import the module nltk.


In [2]:
import nltk


---------------------------------------------------------------------------
ModuleNotFoundError                       Traceback (most recent call last)
<ipython-input-2-1d2184025e54> in <module>()
----> 1 import nltk

ModuleNotFoundError: No module named 'nltk'

This module provides a number of builtin datasets. We will start by importing the object names from the module nltk.corpus. The dataset names consists of a number of first names for both genders. To be more precise, names is an object of class nltk.corpus.util.LazyCorpusLoader that provides methods to load both female and male first names.


In [ ]:
from nltk.corpus import names
type(names)

Let us load these names into lists.


In [ ]:
nltk.download('names')

In [ ]:
FemaleNames = names.words('female.txt')
MaleNames   = names.words('male.txt'  )
print('Number of female first names:', len(FemaleNames))
print('Number of   male first names:', len(MaleNames))

It seems that there are more female first names than there are male first names. Lets take a look at the first 5 female names.


In [ ]:
FemaleNames[:5]

Similarly, we inspect the male names.


In [ ]:
MaleNames[:5]

We combine these two lists into one list of tagged names, where a tagged name is a pair of the form $$ (\textrm{name}, \textrm{gender}) \quad \mbox{such that $\textrm{gender} \in \{\texttt{'f'},\texttt{'m'}\}$.}$$


In [ ]:
Names = [(n, 'm') for n in MaleNames] + [(n, 'f') for n in FemaleNames]

Our goal is to test whether it is possible to predict the gender of a given name using a Naive Bayes classifier. In order to be able to make a quantitative assessment of the accuracy of the classifier, we have to split our data into a training dataset and a testing dataset. To minimize any bias, the assignment of the names into those datasets should be done randomly. In order for our results to be reproducible, we set a seed for the random number generator. This ensures that the random number generator will always behave the same way.


In [ ]:
import random
random.seed(1)
random.shuffle(Names)
len(Names)

We assign the majority of the names to the training set. Roughly 10% of the data are assigned to the test set.


In [ ]:
train_set, test_set = Names[:7000], Names[7000:]
len(test_set)/(len(Names))

Next, we need to decide the features that we want to use in order predict the gender of a name. Our first attempt to predict the gender of a word uses just a single feature. This feature is the substring containing the last two characters of the name.

The classifiers that are already implemented in NLTK assume a special format for the features: The features of an object to be classified have to be implemented as a dictionary. The keys of the features are supposed to be short descriptions of the features. Later, we will try to increase the accuracy of our prediction by adding more features.


In [ ]:
def gender_features(word):
    return { 'ending': word[-2:] }

Let's test this function on the name 'Hugo'.


In [ ]:
gender_features('Hugo')

We have to transform the names in our training set into features in order to train a classifier.


In [ ]:
train_set_features = [(gender_features(n), g) for (n, g) in train_set]
train_set_features[:10]

Now we are ready to train our first classifier. To begin with, we use a NaiveBayesClassifier, which is already predefined in the module nltk.


In [ ]:
classifier = nltk.NaiveBayesClassifier.train(train_set_features)

Let us check whether this classifier can predict the gender of the name Hugo.


In [ ]:
classifier.classify(gender_features('Hugo'))

The classifier has correctly predicted the gender of 'Hugo' to be male. But before we get too excited, we should check the accuracy of the classifier on the training set.


In [ ]:
nltk.classify.accuracy(classifier, train_set_features)

Given that this is our first attempt, an accuracy of 80% is not too bad. After all, so far we are using just a single feature. The question is, whether our classifier is able to generalize its predictions to examples it has not seen before. In order to answer this question we have to use the test set. Again, we first have to transform the names from the test set into features.


In [ ]:
test_set_features = [(gender_features(n), g) for (n, g) in test_set]
nltk.classify.accuracy(classifier, test_set_features)

The performance on the test set is slightly worse, but given that we have a bias of 20%, there is no need to worry about a variance of 2% at this point. Of course, this remark only holds if we assume that the so called Bayes optimal error is close to 0%. If, instead, the Bayes optimal error would be, say, 19%, then we can never achieve an accuracy that is better than 81%. In that case the difference of 2% between the test set and the training set would have to be investigated further, because it is then more promising to reduce this error than to try to reduce the 1% that separates the error on the training set from the best possible error. Of course, initially we do not know the Bayes optimal error. For now I am just assuming that it is 15% or less.

The NaiveBayesClassifier has a useful method called $\texttt{show_most_informative_features}(n)$ which shows the $n$ most important features.


In [ ]:
classifier.show_most_informative_features(30)

For example, this output tell us that for 93 female names ending in na there is just one male name that ends in na.

Refining our Model

Next, our goal is to refine our model for gender classification by adding more features. In order to get a better understanding, let us investigate those names that are misclassified. We have to be careful to look at examples from the training set, not from the test set, for if we design features with respect to the test set, then the test set will now longer give us a reasonable estimate of the accuracy of our model.


In [ ]:
errors = [(n, g) for (n, g) in train_set 
                 if classifier.classify(gender_features(n)) != g
         ]
errors

A first attempt to improve our model is to add the first letter that occur in a given word. Furthermore, we check the letters that occur in a name. Below is the new definition of the function gender_features that has these new features. We import the module string because it provides the function lower that converts a string into lower case.


In [ ]:
import string

def gender_features(name):
    features = {}
    features["first" ] = name[0].lower()
    features["suffix"] = name[-2:].lower()
    for letter in string.ascii_lowercase:
        features["has(%s)" % letter] = (letter in name.lower())
    return features

Let's test this on our old friend 'Hugo'.


In [ ]:
gender_features('Hugo')

In [ ]:
len(gender_features('Hugo'))

With this new implementation of the function gender_features we have 28 features, which is a lot more than what we had in our first model. But do these additional features actually improve the performance of our model? We can only answer this question if we train the model and test it. Let us compute the features of the training set.


In [ ]:
train_set_features = [(gender_features(n), g) for (n, g) in train_set]

Next, we train a NaiveBayesClassifier with the new features.


In [ ]:
classifier = nltk.NaiveBayesClassifier.train(train_set_features)

First, we check whether the accuracy for the training set has improved.


In [ ]:
nltk.classify.accuracy(classifier, train_set_features)

Next, we check the accuracy on the test set.


In [ ]:
test_set_features = [(gender_features(n), g) for (n, g) in test_set]
nltk.classify.accuracy(classifier, test_set_features)

Obviously, this is an improvement, but this improvement is less than what we might have hoped for. Let us check the 30 most important features.


In [ ]:
classifier.show_most_informative_features(30)

It seems that the suffix is by far more important than anything else. Therefore, we try a brute force approach and increase the length of the suffix feature to three characters. After all three is more than two, so this should be an improvement. However, we have to take care of the fact that some names have a length of just two characters. Our new implementation of gender_features deals with this case.


In [ ]:
def gender_features(name):
    features = {}
    features["first" ] = name[0].lower()
    if len(name) >= 3:
        features["suffix"] = name[-3:].lower()
    else:
        features["suffix"] = name[-2:].lower()
    for letter in string.ascii_lowercase:
        features["has(%s)" % letter] = (letter in name.lower())
    return features

In [ ]:
train_set_features = [(gender_features(n), g) for (n, g) in train_set]

In [ ]:
classifier = nltk.NaiveBayesClassifier.train(train_set_features)

In [ ]:
nltk.classify.accuracy(classifier, train_set_features)

This looks promising. It seems that we are on the right track. Lets check the test data.


In [ ]:
test_set_features = [(gender_features(n), g) for (n, g) in test_set]
nltk.classify.accuracy(classifier, test_set_features)

In fact, our new features overfit the training data and do not generalize. Hence, we conclude that having a suffix of three characters is not helpful.

My final attempt to solve to improve the accuracy contains three ideas:

  1. Instead of just using the first character as a feature, we should use the first two characters. After all, we are also using the last two characters.
  2. Often, the way the vowels of a name are connected gives a hint about the gender.
  3. In the same way, the consonants might be helpful. However, we will only use the set of all consonants occurring in a name, not the order in which they appear.
Furthermore, in order to reduce the overfitting we will drop the features that check the occurrence of each character individually.

The function find_vowels$(s)$ takes a string $s$ and strips out all characters that are not vowels.


In [ ]:
def find_vowels(s):
    return ''.join([c for c in s if c in 'aeiouy'])

In [ ]:
find_vowels('Hugo')

The function find_consonants$(s)$ takes a string $s$ and returns a set of its consonants.


In [ ]:
def find_consonants(s):
    return frozenset({c for c in s if c not in 'aeiouy'})

In [ ]:
find_consonants('Hugo')

In [ ]:
def gender_features(name):
    name     = name.lower()
    features = {}
    features["first" ] = name[:2]
    features["suffix"] = name[-2:]
    features["vowels"] = find_vowels(name)
    features["consonants"] = find_consonants(name)
    return features

In [ ]:
train_set_features = [(gender_features(n), g) for (n, g) in train_set]
classifier = nltk.NaiveBayesClassifier.train(train_set_features)
nltk.classify.accuracy(classifier, train_set_features)

In [ ]:
test_set_features = [(gender_features(n), g) for (n, g) in test_set]
nltk.classify.accuracy(classifier, test_set_features)

In [ ]:
classifier.show_most_informative_features(40)

All of our features occur in the list of the 30 most important features. In order to improve our model we could try to use a classifier that is different from NaiveBayesClassifier. For example, the MaxentClassifier is more sophisticated than the NaiveBayesClassifier. However, this classifier also takes a much longer time to train.


In [ ]:
train_set_features = [(gender_features(n), g) for (n, g) in train_set]
classifier = nltk.MaxentClassifier.train(train_set_features)
nltk.classify.accuracy(classifier, train_set_features)

This looks like an improvement. Let's check the accuracy on the test set.


In [ ]:
test_set_features = [(gender_features(n), g) for (n, g) in test_set]
nltk.classify.accuracy(classifier, test_set_features)

Actually, the improvement is not real: It is mostly overfitting. The same thing happens if we try the ConditionalExponentialClassifier.

Homework: Try to design features that improve the accuracy of the classifier on the test set.